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Why your model parameter confidences might be too optimistic - 
unbiased estimation of the inverse covariance matrix 
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ABSTRACT 



Aims. The maximum-likelihood method is the standard approach to obtain model fits to observational data and the corresponding confidence 
regions. We investigate possible sources of bias in the log-likelihood function and its subsequent analysis, focusing on estimators of the inverse 
covariance matrix. Furthermore, we study under which circumstances the estimated covariance matrix is invertible. 

Methods. We perform Monte-Carlo simulations to investigate the behaviour of estimators for the inverse covariance matrix, depending on the 
number of independent data sets and the number of variables of the data vectors. 

Results. We find that the inverse of the maximum-likelihood estimator of the covariance is biased, the amount of bias depending on the ratio 
of the number of bins (data vector variables), p, to the number of data sets, n. This bias inevitably leads to an - in extreme cases catastrophic 
- underestimation of the size of confidence regions. We report on a method to remove this bias for the idealised case of Gaussian noise and 
statistically independent data vectors. Moreover, we demonstrate that marginalisation over parameters introduces a bias into the marginalised 
log-likelihood function. Measures of the sizes of confidence regions suffer from the same problem. Furthermore, we give an analytic proof for 
the fact that the estimated covariance matrix is singular if p > n. 
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1. Introduction 



The maximum-likelihood method (e.g. 'Barlow' '1991') IS com- 
mon practice to obtain the best-fit parameters ttq and confi- 
dence regions from a measured data vector d €W for a model 
m{7r). It usually consists of finding the maximum of the log- 
likelihood function 

£{d\7:) oc-i[d- m(n)f 1^ [d - m(n)] , (1) 

where tt is the parameter vector and a Gaussian distribution of 
the measurement eiTors is assumed. The confidence regions for 
the maximum-likelihood fit are then defined by the surfaces of 
constant AX = Xmax - -C, where Xmax is the maximum value of 
the log-likelihood function. 

For the evaluation of the log-likelihood the population co- 
variance matrix I and its inverse Z"' or estimates thereof are 
needed. In most cases, no exact analytical expression for Z 
can be given, although numerous authors make use of ana- 
lytical approximations. An example from the field of weak 



gravitational lensing is ISemboIoni et al. I (l2006h . who use the 



Gaussian approximation to the covariance matr i x of th e shear 
correlation functions given by ISchneider et al.l (l2002h . Other 



* Founded by merging of the Stemwarte, Radioastronomisches 
Institut and Institut fiir Astrophysik und Extraterrestrische Forschung 
der Universitat Bonn 
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ossibiliti es are to estimate Z from the data themselves (e.g. 



Hetterscheidt et alj|2006: .Budavari et al.. ,2003.) or to obtain it 
from a simulated data set whose properties are comparable to 
the original data (e.g. P an & Szapudi 2005*). In the latter pa- 
per, the authors observed that the estimated covariance ma- 
trix becomes singular if p, the number of entries of the data 
vectors, exceeds the number of observations / simulated data 
vectors. As a remedy, t hey propose to us e the Singular Value 
Decomposition (SVD, iPress et al.lll992l) to obtain a pseudo- 
inverse of the covariance matrix, but do not investigate the 
properties of the resulting estimate of Z ' in detail. In this pa- 
per, we prove analytically that the rank of the standard estima- 
tor of the covariance matrix cannot exceed the number of obser- 
vations. We then point out that, even if this estimator is not sin- 
gular, simple matrix inversion yields a biased estimator of Z ' . 
This may, if not corrected for, cause a serious underestimate of 
the size of the confidence regi ons. This problem has also bee n 
noticed by iHirata et"aD (l2004 and iMandelbaum et alJ (l2006t) . 
who use Monte-Carlo simulations to determine the correct con- 
fidence contours in cases where the covariance matrix is noisy. 
We report on the existence of a simpler method to remove this 
bias, which can be derived for Gaussian noise and statistically 
independent data vectors, and test the validity of this method 
when these assumptions are violated. 
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2. The covariance matrix 

2.1. Estimators 

Let be a vector of p random variables with components di, 
drawn from a multi-variate Gaussian distribution with popula- 
tion covariance matrix I and mean fi: 



P(d) 



1 



(2) 



(2;r)^'/2 Vdetl I 2 

Furthermore, let rf*^*' denote the A;-th reaUsation of this random 
vector, where k e [l,n] and n is the total number of realisa- 
tions. The well-known maximum-likelihood est imator for the 
components of the covariance matrix is given by ( Barlowll991 ) 



(3) 



which in the case of a known mean vector /i is unbiased. If, 
however, fi has to be estimated from the data, a correction factor 
of n/(n - 1) has to be applied to 0. 

2.2. The rank of C^^ 

In the following, we prove that is singular for p > nin case 
of known mean vector, and for p > n - 1 if the mean vector is 
obtained from the data as well. For the first case, this can be 
seen by rewriting (O as 



(4) 



where we presume, without loss of generality, that the mean 
vector is zero. Since the data vectors d^''^ are statistically 
independent, we can safely assume that they are linearly 
independent for n < p (for a continuous distribution, the 
probability to draw linearly dependent data vectors is zero). 
Therefore, span an «-dimensional subspace U ofW. To 

check whether is singular we now try to find a vector 
y for which &^^y = 0. Looking at (|4]l, we see that this 
is only possible for p > n, since in this case we can always 
choose a vector y from the subspace orthogonal to U, for 
which ■ J = V ;t. If /7 < n, already spans the whole 
of W, and no vector can be found that is orthogonal to all rf*^**. 
This proves that C^^ is singular for known mean vector if 
p > n. 

We now prove our statement for an unknown mean vector 
fi, which is estimated from the data using 



1 " 



(5) 



For this, we define a new set of independent data vectors jw**^*} 

by forming linear combinations of {rf*^**}, specified by the or- 
thogonal transformation B, of which we d emand that the l ast 
(n-th) row be given 

by(l/Vn, ■■• ■l/A/»') (lAndersonl2003h : 



w 



(k) 



(6) 



Thanks to our choice of B„/, we have h"*^"' = ■\/nfi. Next, we 
rewrite C^^ by means of the new data vectors: 



1 " 

- -y 

n-l 

- -y 



(7) 
(8) 
(9) 



The last expression is of the same form as (|4]i (except for the 
sum, which has one addend less), and so the same line of 
reasoning as above can be applied to show that C^'' is singular 
for p > n - \ . 

Another interesting implication of Eq. Q is that the mean 
vector and the estimated covariance matrix are distributed inde- 



pendently (again see I Anderson 2003 ). although they are com- 
puted from the same data vectors. First, note that w*'' and w*^* 
are statistically independent for / j. This can be seen by com- 
puting the covariance between the two vectors: 



Cov(w®,H'(^>) = ((w«-V«)( 



= ^ B,J;Bj/5j:/Z 



k.l=l 



Here, (■) denotes the expectation value and (see Eq.|6l) 



(10) 
(11) 

(12) 
(13) 

(14) 



is the mean value of h"*^''. 

Since does not depend on w^"'', which in turn is statis- 
tically independent of the remaining h"*^'', this shows the inde- 
pendence of estimated mean and covariance. 

3. The inverse covariance matrix 

3.1. An unbiased estimator for 

From (O, an estimator for Z"' can be obtained by matrix inver- 
sion: 



(15) 



This estimator is consistent, but not unbiased due to noise in 
the inverse of an unbiased estimator for some statisti- 
cal variable X is in general not an unbiased estimator for X^K 
Indeed, in our case of Gaussian eiT ors and sta tistically inde- 
pendent data vectors one can show ( AndersonI 12003 ) that the 
expectation value of C;^' is not the inverse of the population 
covariance, but 



/=i 



N 



N- 



1 



I"' for /9 < - 1 , 



(16) 
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Fig. 1. Ratios of the trace of I ' to the traces of C, ' (triangles) 
and C ' (squares), respectively. The dashed line is for the co- 
variance model ( fTSl ). the solid line for ( fT9] ) and the dot-dashed- 
line for (l20l i. The original data vectors had p\ - 240 bins, and 
were rebinned by subsequently joining 2, 3, ... of the original 
bins. The number of independent observations is « = 60. Error 
bars are comparable to the symbol size and therefore omitted. 

where - n if ^ is known and N - n - \ if the mean is 
estimated from the data. In the following, we will only pursue 
the latter case. 

The amount of bias in C;,:' thus depends essentially on the 
ratio of the number of entries p in the data vectors (henceforth 
referred to as the number of bins) to the number of independent 
observations n. It leads to an underestimation of the size of con- 
fidence regions by making the log-likelihood function steeper, 
but it does not change the maximum-likelihood point and thus 
does not affect the parameter estimates themselves. 

From ( fTSI l it follows that an unbiased estimator of I"' is 
given by' 

C-'^ "~^~^ C;'forp<«-2. (17) 
n - 1 

3.2. Monte-Carlo experiments 

To illustrate Eq. ST% . and also to probe how the pseudo-inverse 
of the estimated covariance obtained by the Singular Value 
Decomposition behaves (see below), we perform the following 

' Note that there is a typing error in Anderson's book, where he 
gives an expression corresponding to 

C-i = " ~ P ~ ^ C:' forp<n-2. 
n - 2 
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experiment: First, we choose an analytical form for the popu- 
lation covariance Z. We use three different models: 

^f-cr^S,, (18) 
Hff = a-^ [l-i/{l+pi)]6ij and (19) 
I.'lf = a^/(l+e\i-j\) , (20) 

which initially are pi x pi matrices, e can be used to tune the 
degree of correlation in model ( |20] i: we choose e - 0.05. 

We then create n data vectors of length p[ according to 
rfW = f/j + yt** (Zi), where 7**'(Zi) is a noise vector drawn 
from a multivariate Gaussian distribution with mean zero and 
covariance Zj. The choice of the model vector m is arbitrary, 
and in fact for the present purpose it would be sufficient to 
set m -0. For later use, however, we choose the linear model 
nii = axi + b, where Xi — (jCmax - J^min)(' + ^/2)/pi is the value 
of the free variable corresponding to the centre of the i-th bin. 

From this synthetic set of observations we estimate the 
mean data vector and the covariance matrix, which yields the 
estimator &^^. Next, both Z and are inverted using the 
Singular Value Decomposition (see below). Finally, we com- 
pute the unbiased estimate C"' of the inverse covariance as 
given in (flTl l. 

To probe the dependence of the bias of the estimators 
for Z"', « new data vectors are created subsequently with 
Pi - Pi /J bins, for all integer j e [2, . . .,pi /2], where the pop- 
ulation covariance Tj for pj bins can be obtained from the orig- 
inal Zi by averaging over (j x j)-sub-blocks of Zi. This strat- 
egy of re-binning has the advantage that the true covariance is 
known exactly for all pj. 

Since the bias in Eq. (fTSI l is just a scalar factor, we record 
the traces of the estimators C;^' and C"' for each number of 
bins p. To improve our statistics, we repeat the procedure 
outlined above lO"* times and average over the traces computed 
in each step. 

In Fig. [T] we plot the ratios of the trace of Z ' to the traces 
of C;^' and C respectively. Not using the bias-corrected C ' 
can have considerable impact on the size of confidence regions 
of parameter estimates: for p < n - 2, the components of ' 
will be too large compared to those of the true inverse covari- 
ance, and the log-likelihood will decrease too steeply, resulting 
in confidence contours too small. 

We also plot the traces of C;^' for the different covariance 
models beyond p > n - I, where the estimator C^^ is singu- 
lar. These data points have been obtained using the Singular 
Value Decomposition to invert the covariance matrix, yielding 
a decomposition of the form 

C = UWV , (21) 

where U and V are orthogonal matrices and W is a diago- 
nal matrix containing the singular values. Since C is symmet- 
ric, one has in addition U = V, while W contains the moduli 
of the eigenvalues of C. The inverse of C is then given by 
C ' - VW 'U'. If C is singular, some of the entries of W will 
be zero or comparable to machine precision. We therefore can 
only compute a pseudo-inverse of C by replacing the inverses 
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of these singu l ar val ues i n W"' by zero, as has b een suggested 4.2. Measuring the size of confidence regions 



in 



Press et al.1 (1 19921) and lPan & Szapudil (12005). Fig. [T] shows 
that the bias of C~' in this regime depends significantly on the 
covariance model chosen and does not depend on binning in a 
simple way. Therefore we strongly discourage from the use of 
the SVDfor/5 >n - 1. 

4. Implications for likelihood analysis 

Having obtained an unbiased estimator of the inverse covari- 
ance matrix, one may still be concerned about a possible bias 
in the log-likelihood function, since it consists of the product 
of {d - fi) and C ' (Eq.[Tli, since fi and C ' are estimated from 
the same set of observations. In other words, the question is if 
it is possible to write 



U(dm = {(ji-mYC-Ujx-m)) 

= -i ({fi}-my{C-')({fi)-m) 



(22) 
(23) 



Luckily, this is indeed the case, since we have shown at the 
end of Sect. 12.21 that mean vector and covariance matrix are 
distributed independently. 



4.1. Marginalised likelHiood 

Usually, one is not only interested in the full parameter space 
of a problem, but also in values of single parameters and the 
corresponding errors. In the Bayesian framework, this is usu- 
ally achieved by marginalising over the "uninteresting" param- 
eters: the log-likelihood function X, for the single parameter tt, 
is computed using 



£i{d\7Ti) = log i 



n/ 



dir. 



expUidm] 



(24) 



There is no reason to believe that the marginahsed log- 
likelihood, which is a highly non-linear function of the (unbi- 
ased) estimate of the full log-likelihood, and with it the size of 
the errors on tti are unbiased, even if one uses the unbiased C"' . 
We demonstrate this by means of our simulated fitting proce- 
dure, where we now use in addition to the straight line model 
also a second simulation using a power-law model of the form 
m, - ax^. We marginalise over the intercept of the line and 
the power-law index, respectively. We record the sums over all 
pixels of the (one-dimensional) grid of the marginalised log- 
likelihood functions, which we compute using the true Z ' and 
the unbiased estimator C For Z, we choose the model ( fTsT l. 
We average over ^ 3 xlQ* repetitions of these experiments. 
We plot the ratio of true to estimated log-likelihood sums in 
Fig.|2](triangles and solid lines). The plot shows a bias of max- 
imally ^ 8% for the straight line and even less for the power- 
law, in a direction which would lead to an overestimation of 
the error bars on slope and amplitude. Although the effect is 
not very large, this is not guaranteed to remain so for models 
different from the ones considered here. 



For some applications, it is useful to have a simple measure of 
the size of the confidence regions. As an example, we make 
use of the Fisher information ma trix F (iFishenil935i) . which is 
defined by dTegmark et al 1 11 9971) 



F s 



(25) 



where the derivative is to be evaluated at the maximum- 
likelihood point ttq. F can be interpreted as an estimate of the 
inverse covariance matrix of the parameter estimates, provided 
X. is well approximated by a Gaussian around the maximum- 
likelihood point. 

To demonstrate the bias in Vdet F^', we compute the 
Fisher matrix for th e straight line and power-law fits using 
dTegmark et al.lll997h 



(26) 



a,B=\ 



which is valid if the covariance matrix does not depend on the 
parameters n,; m is the model vector 

In Fig. |2] we give the ratio of Vdet F"', computed using 
the unbiased estimated covariance C"', to the value computed 
using the true covariance (boxes and dashed lines). One sees 
that in this case the size of the confidence regions is signifi- 
cantly overestimated, for p/n approaching unity by as much as 
~ 30% for the straight line case, and by a comparable, albeit 
slightly smaller factor for the power-law fit. 

5. Bootstrapping and non-Gaussian statistics 

The derivation of the unbiased estimator C"' rests on the as- 
sumptions of Gaussian noise and statistically independent data 
vectors. To test the performance of this estimator in real world 
situations, where one or both of these assumptions may be vi- 
olated, we make use of an example from the domain of weak 
gravitational lensing. For an introduct i on to this field we refer 
the reader to B artelmann & Schneidei (2001). 

We simulate a weak lensing survey consisting of one single 
field, containing A^g galaxies, which are assigned a random el- 
lipticity £. e - ei + \ei is a complex number, which is related to 
the quadrupole moment of the U ght distribution of a galaxy (see 
Bartelmann & Schneiderl20"oi ). The two components of the el- 
lipticity are drawn from a Gaussian distribution with dispersion 
(rjV2. 

The goal of the survey is to measure the shear correlation 
function ^+(??) an d to fit a model prediction to it. An estimator 
for ^+ is given by dSchneid er et al. 2002) 



) A,,(|0, -0,|) 



(27) 



where the galaxies are labelled with / and j and have the an- 
gular positions 0, and 0j. A§{<p) is unity if i? - Ai?/2 < (p < 
§ + A-ff/l, where A§ is the bin width, and zero otherwise. 
Finally, np(§) is the number of pairs of galaxies contributing 
to the correlation function in the bin centred on §. 
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Fig. 2. Triangles, solid lines: Ratio of the sum over all pix- 
els of the marginalised likelihood computed using C ' and the 
true marginalised likelihood. Filled triangles are for the power- 
law fit (marginalised over the power-law index), open triangles 
are for the straight line fit (marginalised over the intercept). 
Squares, dashed lines: Ratio of Vdet F"' using C"' to the true 
one, computed with Z. For both cases I - I'^ '^. 

We also need the covariance matrix of ^+()?), which, since 
we only have one measu rement, is estimated using the boot- 
strapping algorithm (e.g. |Efron~& Tibshirani 1993): First, we 
create a catalogue of all A^p = Ng{Ng - l)/2 possible pairs of 
galaxies in the field. We then create A^b,s bootstrap realisations 
of the survey by repeatedly drawing A^p pairs with replacement 
from the catalogue. From these, we estimate the mean data vec- 
tor and the covariance matrix of the shear correlation function. 
As before, we do this for various numbers of bins, where we 
record the dependence of the traces of Z C;,:' and C ' on 
binning. For the simple case of pure shape noise, the popula- 
tion covariance is diag onal and can be easily computed using 
dSchneider et aljbooa 

2 np()7,) 

where §/ is the angular separation corresponding to the cen- 
tre of the i-th bin. We precompute the function Wp numerically 
from a large set of independent data fields for all binning pa- 
rameters we wish to use in the simulation. 

In principle, both of the assumptions made for the deriva- 
tion of Eq. ( fTTl l are violated: The noise in the shear correlation 
function is ;t'^-distributed, because ^+ oc ee, where e is drawn 
from a Gaussian. However, the number of degrees of freedom 
of the;if^-distribution, which equals the number of pairs, is very 
large, so that it is very well approximated by a Gaussian (cen- 
tral limit theorem). We therefore do not expect any significant 
influence on the performance of C ' . We expect a larger impact 
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Fig. 3. Ratio of the traces of the unbiased estimator C;^' and 
C ' to the trace of I ' ; the covariances for the solid curve have 
been estimated using bootstrapping (see text), the dashed line 
shows the ratio of the traces for log-normal errors. 

by the fact that the data vectors resulting from the bootstrapping 
procedure are not statistically independent, since different bins 
necessarily contain the identical galaxy pairs. Strictly speaking, 
also the requirements for the application of the bootstrap pro- 
cedure are not met, since the pairs of galaxies which we use to 
sample the distribution of the shear correlation function are not 
statistically independent. However, we argue that drawing indi- 
vidual galaxies instead of pairs is not correct, since this would 
sample the distribution of e, not the one of 

The outcome of « 2 x lO'* realisations of this experiment 
is given in Fig.[3](solid line), with A^g = 500 and A^bs = 40. The 
figure shows that, in spite of the correlations among the pairs 
of galaxies and the data vectors, C"' is wrong by only « 1%, 
and may well be used in bootstrap applications like this. 

Finally, we explore the impact of non-Gaussian noise. For 
this purpose, we perform the same experiment as before, only 
replacing the Gaussian noise vectors ones, y**^', with a log- 
normal distribution. These are computed using 

rW=exp(rf>-l/2) , (29) 

where r'*^ are vectors containing uncorrected, Gaussian ran- 
dom variables with mean zero and variance cr- = 1. The result 
is shown in Fig. [3] (dashed line). Clearly, Eq. (fTTT i is no longer 
applicable, although for p/n > 0.2, one still does much better 
with it than without it. 

6. Summary and conclusions 

We have given a proof for the fact that the standard estima- 
tor of the covariance matrix (|3]l is singular for p > n if the 
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mean vector used in ^ is known, and for p > n - I if the 
mean is estimated from the same data set as the covariance ma- 
trix. Furthermore, we noted that the inverse of the maximum- 
likeUhood estimator of the covariance matrix is a biased esti- 
mator of the inverse population covariance matrix. This bias de- 
pends basically on the ratio of the number of bins to the number 
of independent observations and can be quite severe as these 
two numbers become comparable. If uncorrected for, it will 
lead to a significant underestimation of the size of confidence 
regions derived from maximum-likelihood fits. The bias can 
be corrected for p < w - 2 by using the estimator (fTTl l instead, 
which was derived by lAndersonI (l2003b under the assumption 
of Gaussian errors and statistically independent data vectors. 
We stress that there is no contradiction between the foregoing 
two statements: The singularity of Cml for p > n - I derives 
from linear algebra alone, whereas the fact that C"' is zero for 
p - n - lis due to the statistical distribution of the covariance 
matrix. Going beyond p - n - l,we find that it is not advisable 
to use the Singular Value Decomposition to invert the estimated 
covariance matrix, since the bias of the pseudo-inverse does not 
seem to be controllable and depends strongly on the population 
covariance matrix, which is a-priori unknown. 

Given the unbiased estimator C we argue that also the 
log-likelihood function is unbiased. However, great care has to 
be taken if one wishes to perform further analysis of X.'. since 
it is a statistical variable, any nonlinear operation on it has the 
potential to cause a bias. We demonstrate this for the case of 
marginalisation over certain parameters, where the bias is rel- 
atively mild for the examples we chose. The situation is much 
worse if one tries to quantify the size of the confidence regions. 
The square root of the determinant of the inverse Fisher matrix 
shows a significant amount of bias, and therefore should not be 
used as absolute measures of measurement uncertainty. 

The upshot of all this is the following: avoid to use more 
bins for your likelihood fit than you have realisations of your 
data. If your errors are Gaussian and the data vectors are statis- 
tically independent, use the estimator C ' to obtain an unbiased 
estimate of the inverse covariance matrix and the log-likelihood 
function. If one or both of these two requirements are not ful- 
filled, the estimator is not guaranteed to work satisfactorily; this 
should be checked from case to case. 

Finally, we note that the estimates of the covariance ma- 
trix and its inverse can be quite noisy. If one has prior knowl- 
edge about the structure of the covariance matrix, one can de- 
velop estimators with a much lower noise level. Since this noise 
is responsible for most of the problems discussed in this pa- 
per, these improved estimators may also be useful in situations 
where the requirements for the use of C ' are not fulfilled. We 
will explore these possibilities in a future paper. 
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